feat: add local llama-cpp embedding support by Mijamind719 · Pull Request #1388 · volcengine/OpenViking

Mijamind719 · 2026-04-12T12:50:42Z

Description

This PR extends the openviking-server init interactive setup wizard to support a second local embedding path based on llama-cpp-python, while preserving the existing Ollama-based local setup flow.

In other words, this PR follows the "route 2" direction we discussed:

keep the Ollama setup path
add a parallel llama-cpp-python setup path
do not replace or remove Ollama
guide users to install local embedding dependencies through the optional extra dependency path

This PR builds on top of the existing local embedding runtime support already present on main; it mainly focuses on setup UX, config generation, and a config initialization deadlock fix.

Related Issue

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Extend openviking-server init with a new llama.cpp / GGUF local embedding path.
Keep the existing Ollama setup path as a separate option instead of replacing it.
Add a local GGUF preset for bge-small-zh-v1.5-f16.
Add interactive checks for llama-cpp-python, with guided installation via:
- pip install "openviking[local-embed]"
Add local config generation for:
- builtin GGUF preset
- optional cache dir
- optional VLM config
In the llama.cpp setup path, allow three VLM choices:
- Ollama VLM
- Cloud API VLM
- Skip VLM
Add lazy imports in openviking/storage/__init__.py to avoid import-time deadlock when the native engine extension is loaded.
Fix re-entrant config singleton initialization in OpenVikingConfigSingleton.
Add and update setup wizard tests for the new local config path.

Why This Matches Route 2

This PR matches the route-2 direction because it does not collapse everything into a single Ollama-only or llama-cpp-only flow.

Instead, it keeps both local paths available:

Ollama remains the service-style local model option
llama-cpp-python becomes the direct GGUF local embedding option

That is the core product shape we agreed on for route 2.

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Validation performed:

python3 -m py_compile openviking/storage/__init__.py openviking_cli/setup_wizard.py openviking_cli/utils/config/open_viking_config.py tests/cli/test_setup_wizard.py
PYTHONPATH=. ./.venv/bin/python -m pytest -q tests/cli/test_setup_wizard.py --maxfail=1
PYTHONPATH=. ./.venv/bin/python -m pytest -q tests/cli/test_doctor.py --maxfail=1

Known Limitations

This PR focuses on setup/config UX, not on reworking the embedding runtime itself.
The wizard can generate config even if the local dependency install step is skipped or fails; actual runtime validation still happens later during startup/doctor checks.
The builtin preset list is intentionally small for now and currently includes bge-small-zh-v1.5-f16 only.

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

The old PR description referred to the earlier local embedding runtime implementation. This has been updated to reflect the current scope of the branch, which is now centered on the setup wizard, route-2 local setup shape, and the config/import fixes.

github-actions · 2026-04-12T12:52:37Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add embedding metadata validation to collection initialization Relevant files: openviking/storage/collection_schemas.py openviking/storage/errors.py openviking/storage/viking_vector_index_backend.py openviking/storage/vikingdb_manager.py tests/storage/test_collection_schemas.py Sub-PR theme: Add local llama-cpp embedding support Relevant files: openviking/models/embedder/init.py openviking/models/embedder/base.py openviking/models/embedder/local_embedders.py openviking_cli/doctor.py openviking_cli/utils/config/embedding_config.py tests/cli/test_doctor.py tests/misc/test_config_validation.py tests/unit/test_local_embedder.py pyproject.toml
⚡ Recommended focus areas for review Backward Compatibility Break The `init_context_collection` function now raises `EmbeddingConfigurationError` when the storage backend does not implement `get_collection_meta`, breaking existing deployments that use backends without this method. Previously, the function simply returned `False` when the collection already existed. existing_meta = None if hasattr(storage, "get_collection_meta"): existing_meta = await storage.get_collection_meta() if not existing_meta: raise EmbeddingConfigurationError( "Existing collection metadata is unavailable; cannot validate embedding compatibility" ) Blocking Async Operations The `LocalDenseEmbedder` does not override the base class async methods (`embed_async`, `embed_batch_async`). The base class default implementation may not properly offload the blocking llama-cpp operations to a thread pool, potentially starving the async event loop. class LocalDenseEmbedder(DenseEmbedderBase): """Dense embedder backed by a local GGUF model via llama-cpp-python.""" def __init__( self, model_name: str = DEFAULT_LOCAL_DENSE_MODEL, model_path: Optional[str] = None, cache_dir: Optional[str] = None, dimension: Optional[int] = None, query_instruction: Optional[str] = None, config: Optional[Dict[str, Any]] = None, ): runtime_config = dict(config or {}) runtime_config.setdefault("provider", "local") super().__init__(model_name, runtime_config) self.model_spec = get_local_model_spec(model_name) self.model_path = model_path self.cache_dir = cache_dir or DEFAULT_LOCAL_MODEL_CACHE_DIR self.query_instruction = ( query_instruction if query_instruction is not None else self.model_spec.query_instruction ) self._dimension = dimension or self.model_spec.dimension if self._dimension != self.model_spec.dimension: raise ValueError( f"Local model '{model_name}' has fixed dimension {self.model_spec.dimension}, " f"but got dimension={self._dimension}" ) self._resolved_model_path = self._resolve_model_path() self._llama = self._load_model() def _import_llama(self): try: module = importlib.import_module("llama_cpp") except ImportError as exc: raise EmbeddingConfigurationError( "Local embedding is enabled but 'llama-cpp-python' is not installed. " 'Install it with: pip install "openviking[local-embed]". ' "If you prefer a remote provider, set embedding.dense.provider explicitly in ov.conf." ) from exc llama_cls = getattr(module, "Llama", None) if llama_cls is None: raise EmbeddingConfigurationError( "llama_cpp.Llama is unavailable in the installed llama-cpp-python package." ) return llama_cls def _resolve_model_path(self) -> Path: if self.model_path: resolved = Path(self.model_path).expanduser().resolve() if not resolved.exists(): raise EmbeddingConfigurationError( f"Local embedding model file not found: {resolved}" ) return resolved cache_root = Path(self.cache_dir).expanduser().resolve() cache_root.mkdir(parents=True, exist_ok=True) target = get_local_model_cache_path(self.model_name, self.cache_dir) if target.exists(): return target self._download_model(self.model_spec.download_url, target) return target def _download_model(self, url: str, target: Path) -> None: logger.info("Downloading local embedding model %s to %s", self.model_name, target) tmp_target = target.with_suffix(target.suffix + ".part") try: with requests.get(url, stream=True, timeout=(10, 300)) as response: response.raise_for_status() with tmp_target.open("wb") as fh: for chunk in response.iter_content(chunk_size=1024 * 1024): if chunk: fh.write(chunk) os.replace(tmp_target, target) except Exception as exc: tmp_target.unlink(missing_ok=True) raise EmbeddingConfigurationError( f"Failed to download local embedding model '{self.model_name}' from {url} " f"to {target}: {exc}" ) from exc def _load_model(self): llama_cls = self._import_llama() try: return llama_cls( model_path=str(self._resolved_model_path), embedding=True, verbose=False, ) except Exception as exc: raise EmbeddingConfigurationError( f"Failed to load GGUF embedding model from {self._resolved_model_path}: {exc}" ) from exc def _format_text(self, text: str, *, is_query: bool) -> str: if is_query and self.query_instruction: return f"{self.query_instruction}{text}" return text @staticmethod def _extract_embedding(payload: Any) -> List[float]: if isinstance(payload, dict): data = payload.get("data") if isinstance(data, list) and data: item = data[0] if isinstance(item, dict) and "embedding" in item: return list(item["embedding"]) if "embedding" in payload: return list(payload["embedding"]) raise RuntimeError("Unexpected llama-cpp-python embedding response format") @staticmethod def _extract_embeddings(payload: Any) -> List[List[float]]: if isinstance(payload, dict): data = payload.get("data") if isinstance(data, list): vectors: List[List[float]] = [] for item in data: if not isinstance(item, dict) or "embedding" not in item: raise RuntimeError( "Unexpected llama-cpp-python batch embedding response format" ) vectors.append(list(item["embedding"])) return vectors raise RuntimeError("Unexpected llama-cpp-python batch embedding response format") def embed(self, text: str, is_query: bool = False) -> EmbedResult: formatted = self._format_text(text, is_query=is_query) def _call() -> EmbedResult: payload = self._llama.create_embedding(formatted) return EmbedResult(dense_vector=self._extract_embedding(payload)) try: result = self._run_with_retry( _call, logger=logger, operation_name="local embedding", ) except Exception as exc: raise RuntimeError(f"Local embedding failed: {exc}") from exc estimated_tokens = self._estimate_tokens(formatted) self.update_token_usage( model_name=self.model_name, provider="local", prompt_tokens=estimated_tokens, completion_tokens=0, ) return result def embed_batch(self, texts: List[str], is_query: bool = False) -> List[EmbedResult]: if not texts: return [] formatted = [self._format_text(text, is_query=is_query) for text in texts] def _call() -> List[EmbedResult]: payload = self._llama.create_embedding(formatted) return [ EmbedResult(dense_vector=vector) for vector in self._extract_embeddings(payload) ] try: results = self._run_with_retry( _call, logger=logger, operation_name="local batch embedding", ) except Exception as exc: raise RuntimeError(f"Local batch embedding failed: {exc}") from exc estimated_tokens = sum(self._estimate_tokens(text) for text in formatted) self.update_token_usage( model_name=self.model_name, provider="local", prompt_tokens=estimated_tokens, completion_tokens=0, ) return results def get_dimension(self) -> int: return self._dimension def close(self): close_fn = getattr(self._llama, "close", None) if callable(close_fn): close_fn()

github-actions · 2026-04-12T12:54:11Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category Suggestion Impact

General

Add retries for model downloads

Add retry logic for model downloads using the existing _run_with_retry helper to
improve resilience against transient network errors.

openviking/models/embedder/local_embedders.py [145-161]

 def _download_model(self, url: str, target: Path) -> None:
     logger.info("Downloading local embedding model %s to %s", self.model_name, target)
     tmp_target = target.with_suffix(target.suffix + ".part")
-    try:
+
+    def _download():
         with requests.get(url, stream=True, timeout=(10, 300)) as response:
             response.raise_for_status()
             with tmp_target.open("wb") as fh:
                 for chunk in response.iter_content(chunk_size=1024 * 1024):
                     if chunk:
                         fh.write(chunk)
         os.replace(tmp_target, target)
+
+    try:
+        self._run_with_retry(
+            _download,
+            logger=logger,
+            operation_name="local model download",
+        )
     except Exception as exc:
         tmp_target.unlink(missing_ok=True)
         raise EmbeddingConfigurationError(
             f"Failed to download local embedding model '{self.model_name}' from {url} "
             f"to {target}: {exc}"
         ) from exc

Suggestion importance[1-10]: 6

__

Why: This improves resilience against transient network errors during model downloads by reusing the existing _run_with_retry helper, making the local embedder more robust.

Low

jcp0578 · 2026-04-13T06:33:07Z

发现有几个

[P1] 现有集合缺少 metadata 时会直接阻断升级路径
create_collection() 返回 False 后，不再沿用“集合已存在就继续运行”的旧行为，而是要求现有 collection 必须能读到 embedding metadata，否则直接抛 EmbeddingConfigurationError。这会让旧版本创建的 collection，或尚未实现 get_collection_meta() 的存储后端，在升级后无法继续启动，即使它们之前一直可以正常工作。可考虑为 legacy collection 提供兼容分支或显式迁移路径，或补充说明
文件: openviking/storage/collection_schemas.py
行号: 225-232
[P2] Local embedder 的 async 路径仍会同步阻塞事件循环
embed_compat() 会优先调用 embed_async()，但基类默认实现只是同步执行 self.embed(...)。LocalDenseEmbedder 没有覆写 async 方法，因此这里的 llama-cpp-python 推理，以及相关的本地初始化路径，仍会沿着 async 调用链在事件循环线程内执行。对于服务端并发 embedding / 入库场景，这会带来明显的延迟放大和可用性风险。建议显式覆写 async 接口，并把阻塞调用卸载到线程池。
文件: openviking/models/embedder/local_embedders.py
行号: 76-89
[P2] model identity 计算失败时会静默降级为仅按模型名比较
这个分支本来是在把本地模型文件身份编码进 collection metadata，以提高 embedding 兼容性校验的准确性；但当前实现把路径解析和哈希计算整个包在 except Exception: 里，失败后直接回退成 model_identity = model，同时没有任何日志。这样一旦 identity 计算异常，不同本地模型文件就会重新退化成“只按模型名比较”，从而削弱这里新增的不兼容检测。建议至少记录 warning，并只捕获预期异常类型。
文件: openviking/storage/collection_schemas.py
行号: 146-154

…mports, and config singleton deadlock fix Made-with: Cursor Co-authored-by: GPT-5.4 <noreply@openai.com>

* feat: local llama.cpp embedding support, setup wizard, lazy storage imports, and config singleton deadlock fix Made-with: Cursor Co-authored-by: GPT-5.4 <noreply@openai.com> * fix: remove unsupported custom local gguf setup --------- Co-authored-by: GPT-5.4 <noreply@openai.com>

This reverts commit c13a9ee.

github-project-automation bot moved this to Backlog in OpenViking project Apr 12, 2026

github-project-automation bot added this to OpenViking project Apr 12, 2026

Mijamind719 assigned ZaynJarvis Apr 12, 2026

Mijamind719 force-pushed the embedding_local branch 2 times, most recently from 6e57688 to f6ff2a0 Compare April 13, 2026 03:39

Mijamind719 force-pushed the embedding_local branch from f6ff2a0 to 4d6e34e Compare April 15, 2026 00:59

Mijamind719 changed the title ~~【WIP】feat: add local llama-cpp embedding support~~ feat: add local llama-cpp embedding support Apr 15, 2026

feat: local llama.cpp embedding support, setup wizard, lazy storage i…

0fe5264

…mports, and config singleton deadlock fix Made-with: Cursor Co-authored-by: GPT-5.4 <noreply@openai.com>

Mijamind719 force-pushed the embedding_local branch from 4d6e34e to 0fe5264 Compare April 15, 2026 03:11

fix: remove unsupported custom local gguf setup

c22fdaf

MaojiaSheng approved these changes Apr 15, 2026

View reviewed changes

MaojiaSheng merged commit 3b0ac8a into volcengine:main Apr 15, 2026
10 of 11 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Apr 15, 2026

qin-ctx added a commit that referenced this pull request Apr 15, 2026

Revert "feat: add local llama-cpp embedding support (#1388)"

da26f41

This reverts commit c13a9ee.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add local llama-cpp embedding support#1388

feat: add local llama-cpp embedding support#1388
MaojiaSheng merged 2 commits intovolcengine:mainfrom
Mijamind719:embedding_local

Mijamind719 commented Apr 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

jcp0578 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Mijamind719 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Changes Made

Why This Matches Route 2

Testing

Known Limitations

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

github-actions bot commented Apr 12, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Apr 12, 2026

PR Code Suggestions ✨

Uh oh!

jcp0578 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Mijamind719 commented Apr 12, 2026 •

edited

Loading